Optimizing Sequential Models: RNN, LSTM and GRUΒΆ
In our last, blog (Comparing Sequential Models) we saw how our sequential models (RNN, LSTM and GRU) were converging and behaving almost similar when comparing multiple stacked layer architecture, and their rate of convergence were almost similar. But when thinking about it again, we did not perform any preprocessing steps like feature selection, or dataset triming, also while modeling we did not considered using batch or layer normalization, which would easily had made a huge impact in our rate of convergence.
So, in this blog post we are going to look in to some of those techniques and will evaluate how much impactful they are.
But before, moving on let's re-visit our data!
Data Recap!ΒΆ
We will continue working on the same data from our last post, which is the stock price of Ethereum. Let's have a quick overview of it, this data is a daily recorded data, starting from August 7, 2015 to August 16, 2025, make a total 3634 records. As for features, we have Date, Open, High, Low, Close, Volume, Volume(ETH), and Market Cap.
Lets take a look to verify whether the data with which we are working is "OK! for our models" or is it too much for us to handle.
(Note: Again to make our model struggle, we are going to continue working with our large dataset (in terms of number of records, not to be confused with the following step, feature selection), and will not trim it.)
Hmm... I thought, the features we have for training, the better our model gets, right? NOPE! Every model has a specific architecture, which decides, what kind of input data will be good for it. And clearly, we can see that our simple architecture sequential models are not so good at handling and finding the underlying pattern from all the features. But when we trim down our features, our model has less to deal with, and can find deeper pattern within it. (Note: We have to be careful when removing columns, because we donot have to remove important columns which might be hempful in our prediction.)
In our case, as can observe that not only we were able to converge at a lower point compared to our model utilizing all of our data, but it continues to keep minimizing the loss further and further even after 2000 epochs. So, now we know, that we should not take more than what we need, for training. From now on we will be only be modeling based on our Close price.
Now lets also see whether there are any impacts on our results from our previous post, based on this new change.
These results looks a little better than our previous approch, using all the features. Let's compare them side-by-side.
We can observe that, especially for smaller architectures the performance gains are significant, and as we keep increasing number of stacked layers, the gap keeps converging, and they provide similar performance irrespective of the models.
So, from this point and onwards, we are going to only use the Close price instead of using all the feature to predict the value.
Training OptimizationΒΆ
In the world of deep learning, training a neural network can sometimes feel like trying to hit a moving target in the dark. As your network learns, the distribution of the inputs to each layer can change wildly, based on the scale of the features, a problem known as internal covariate shift. This instability makes training slow and difficult. To solve this, we use powerful techniques called Batch Normalization ($\mathcal{BN}$) and Layer Normalization ($\mathcal{LN}$).
Both methods work by normalizing the inputs to a layer, but they do so in fundamentally different ways.
Batch Normalization (BN): The Crowd-PleaserΒΆ
Batch Normalization normalizes the output of a previous layer by calculating the mean and standard deviation across the entire batch of training examples for each feature.
Imagine a classroom where every student takes a test. To "batch normalize" the scores, you would calculate the average score and standard deviation for a single question (a feature) across all students (the batch). You'd then use these stats to standardize each student's score for that specific question.
Mathematically, for a feature i, the mean (${\mu}_i$) and variance (${\sigma}_i^2$) are calculated across all examples in the batch. Each feature value $x_i$ is then normalized:
$$\hat{x}_i = \dfrac{x_i - {\mu}_iβ}{\sqrt{{\sigma}_i^2 + \epsilon}}$$
Where $\epsilon$ is a small constant for numerical stability. The network then learns two more parameters, gamma ($\gamma$) and beta ($\beta$), to scale and shift this normalized value, allowing the network to learn the optimal distribution.
$$\hat{x}_i = \gamma \cdot \mathcal{BN}(x_i) + \beta $$
Why we use it:
- Speeds up training: By stabilizing the input distributions, it allows for higher learning rates.
- Reduces sensitivity to initialization: The network becomes less dependent on the initial random weights.
- Acts as a regularizer: The noise introduced by the batch statistics can have a slight regularizing effect, sometimes reducing the need for dropout.
When to use it:
Batch Normalization is the go-to choice for Convolutional Neural Networks (CNNs) and other feed-forward networks where the batch size is sufficiently large and the inputs are independent. It's a cornerstone of modern computer vision models like ResNet.
(Author's Note: The data might be a little tough to see, because of some very large spikes, so please zoom in, to view each segment clearly.)
Initially, we are able to converge really fast, but during the training period we keep fluctuating, and the same can we seen in the testing also. This is bacause in our mini-batch we have some records in which the price of Ethereum is low, while other records have some high value. So, most of the time, their normalize values will be far from their actual values, making our training and testing very difficult. That's why in early epochs, our model learn quickly, but after that when model tries to learn deeper patterns, it will face trouble because there are no such records in reality, they are just the batch-normalized value, whcih depends on the formation of each batch, and which is random. So, that's why our model starts struggling.
Layer Normalization, on the other hand, computes the mean and standard deviation across all the features for a single training example. It doesn't care what other examples in the batch are doing.
Layer Normalization (LN): The IndividualistΒΆ
Back to our classroom analogy: to "layer normalize" a student's test, you would calculate the average score and standard deviation for that one student across all the questions. You'd then use these personal stats to normalize that student's score on each question.
Why we use it:
- Works with any batch size: Since it operates on a single example at a time, it's effective even with a batch size of one.
- Excellent for sequential data: It's highly effective in Recurrent Neural Networks (RNNs) and Transformers, where the sequence length can vary, making batch-level statistics unreliable.
When to use it:
Layer Normalization is the standard for RNNs, LSTMs, and Transformers. It has been a key ingredient in the success of models like BERT and GPT, which process sequential data (natural language).
(Author's Note: The data might be a little tough to see, because of some very large spikes, so please zoom in, to view each segment clearly.)
As we can see, after adding layer norm our model is able to converge faster, and is much more stable as compared to our regular stacked models and also with the Batch Normalization applied.